Part 1: Dataset Selection and Justification¶

The dataset selected contains data relating to salaries for jobs in Artificial Intelligence. Much consideration was given to dataset selection; however, this dataset is personally interesting and far less morbid than others considered such as the Scottish Government's datasets on road casualties, healthcare, and poverty.

The dataset is a suitable size for this project. I experimented with some larger datasets such as that on road casualties in the UK (found here: https://www.data.gov.uk/dataset/cb7ae6f0-4be6-4935-9277-47e5ce24a11f/road-safety-data) where there are several files that could make for some really interesting analysis and visualisation. However, this resulted in Jupyter Notebooks slowing down significantly. I have therefore opted for a smaller dataset that is still interesting and also more relevant to my degree. It will allow for fruitful analysis as it contains several columns that take a mixture of qualitative and quantitative data. This will allow me to ask various questions of the dataset, such as:

  • What is the most common job in AI?
  • What is the highest salary overall?
  • What is the average salary for jobs in AI?
  • Where are the majority of jobs in AI?

These questions will provide a deeper insight into the dataset. They may reveal a trend over time, or they may reveal surpising relationships. I would expect, for example, the salary to have been on a steady incline over recent years given the rapid advancement of the field. It is more difficult to predict, without exploratory analysis, where the majority of AI jobs are located.

The link to the selected dataset is: https://github.com/plotly/datasets/blob/master/salaries-ai-jobs-net.csv

Part 2: Data Cleaning and Transformation¶

Data Understanding¶

The dataset is accessed through the url provided above. There is only one url, which links to one table with several rows and columns. The dataset is loaded into a Pandas DataFrame below. Then, the number of rows and columns the dataset contains is displayed, along with the first 5 rows of the dataset.

In [1]:
import pandas as pd
In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/salaries-ai-jobs-net.csv')
In [3]:
print("Salaries dataset includes", df.shape[0], "rows and", df.shape[1], "columns")
Salaries dataset includes 637 rows and 11 columns
In [4]:
df.head()
Out[4]:
work_year experience_level employment_type job_title salary salary_currency salary_in_usd employee_residence remote_ratio company_location company_size
0 2022 MI FT Data Analyst 90000 SGD 65950 SG 50 SG M
1 2022 MI FT AI Scientist 200000 USD 200000 US 100 US M
2 2022 EN FT Machine Learning Developer 180000 USD 180000 US 100 US L
3 2022 MI FT Data Scientist 153000 USD 153000 US 100 US L
4 2022 SE FT Data Engineer 210000 USD 210000 US 100 US M

From the above we can see that each row represents a single employee. There are 11 potential variables to be used in analysis.

Missing Data and Types of Features¶

We can inspect the data types of each feature through the info() function which displays each column along with the number of non-null values and its corresponding type. We can also make use of the isnull() function to check for missing data.

In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 637 entries, 0 to 636
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   work_year           637 non-null    int64 
 1   experience_level    637 non-null    object
 2   employment_type     637 non-null    object
 3   job_title           637 non-null    object
 4   salary              637 non-null    int64 
 5   salary_currency     637 non-null    object
 6   salary_in_usd       637 non-null    int64 
 7   employee_residence  637 non-null    object
 8   remote_ratio        637 non-null    int64 
 9   company_location    637 non-null    object
 10  company_size        637 non-null    object
dtypes: int64(4), object(7)
memory usage: 54.9+ KB
In [6]:
df.isnull().sum()
Out[6]:
work_year             0
experience_level      0
employment_type       0
job_title             0
salary                0
salary_currency       0
salary_in_usd         0
employee_residence    0
remote_ratio          0
company_location      0
company_size          0
dtype: int64

Evidently, the dataset does not contain any missing values, so no action should be taken in this regard. If the dataset included missing values, a decision would be made as to whether the respective rows should be removed or whether a value should be imputed. This decision depends on the context and there is no hard and fast rule. It could be that a standard value (such as -1) is inserted into fields with missing values in place of the null value.

Extracting and Creating Features¶

From the information above, I will assume the 'salary' column relates to an employee's annual salary. It would be helpful to be able to see at a glance an employee's monthly salary too. To insert this into the dataset we can extract the employee's salary to calculate the employee's monthly salary, then create a new feature to store the result.

In [7]:
df['monthly_salary'] = df['salary']/12
df['monthly_salary'] = df['monthly_salary'].astype('int64') #store as integers

Another part of data cleaning and transformation is standardising the data to ensure the data is prepared for analysis. We must first check what values a column contains. This is achieved through the unique() function.

In [8]:
def displayUniqueValues():
    for col in df.columns:
        print(col, df[col].unique(), "\n")
In [9]:
displayUniqueValues()
work_year [2022 2020 2021] 

experience_level ['MI' 'EN' 'SE' 'EX'] 

employment_type ['FT' 'PT' 'CT' 'FL'] 

job_title ['Data Analyst' 'AI Scientist' 'Machine Learning Developer'
 'Data Scientist' 'Data Engineer' 'Machine Learning Scientist'
 'Machine Learning Engineer' 'Data Science Manager' 'ML Engineer'
 'Data Analytics Manager' 'ETL Developer' 'Lead Data Engineer'
 'Data Architect' 'Head of Machine Learning' 'Data Science Engineer'
 'Head of Data Science' 'Analytics Engineer' 'Machine Learning Manager'
 'Director of Data Science' 'NLP Engineer' 'Business Data Analyst'
 'Machine Learning Infrastructure Engineer'
 'Applied Machine Learning Scientist' 'Applied Data Scientist'
 'Computer Vision Engineer' 'Head of Data' 'Research Scientist'
 'Principal Data Analyst' 'Product Data Analyst' 'Data Analytics Lead'
 'Data Engineering Manager' 'Data Analytics Engineer'
 'Principal Data Scientist' 'Computer Vision Software Engineer'
 'Financial Data Analyst' 'Lead Data Scientist'
 'Lead Machine Learning Engineer' 'Principal Data Engineer'
 'Big Data Engineer' 'BI Data Analyst' 'Data Science Consultant'
 '3D Computer Vision Researcher' 'Lead Data Analyst'
 'Marketing Data Analyst' 'Director of Data Engineering'
 'Cloud Data Engineer' 'Big Data Architect' 'Staff Data Scientist'
 'Finance Data Analyst' 'Data Specialist'] 

salary [   90000   200000   180000   153000   210000   100000   150075   110925
    22800   160000    92000   202900   131300    20000    15000   175000
   135000   193000    83000    75000    55000   186000   148800   112900
    90320   240000   300000    62500    95000   120000   145000   105400
    43200   215300   158200   209100   154600   115934    81666   155000
    80000   164000   132000   170000   123000   189650   164996    50000
   150000   165400   132320   208775   147800   136994   101570   128875
    93700  6000000    28500   183600   100800    40000    30000    70000
    60000   140400    45000   260000    35000    82900    63900   112300
   241000   159000    58000   136000   108800   242000   165220   120160
   124190   181940   220110   160080   126500   106260   116000    99000
   120600   130000   102100    84900   136620    99360   146000   110000
   161342   137141   167000   211500   138600   192400    90700    61300
   113000    95550   115500   243900   156600   136600   109280   224000
   167875   205300   176000   144000   200100    70500    54000   184700
   175100   140250   116150    99050    85000   214000   192600   266400
   213120   115000   141300   206699    99100   110500   192564   144854
   230000   150260    67000    52000   154000   126000   129000   140000
    69000    25000   105000   220000    65000   324000   216000   185100
   104890   157000   250000  1400000  2400000    53000   109000    88000
    10000    87000    66500    78000   121000    57000    48000   165000
    29000    69999    52800    59000   152500   405000   380000   177000
    62000  8500000  7000000   148000    24000    38400    82500    42000
  3000000   125000   700000     8760    51999   450000    41000   159500
    13400   103000    12000   400000   270000    68000   138000    45760
   235000   225000    44000  2250000    37456   106000 11000000    14000
    81000  2200000   276000   188000   174000    93000  2100000    51400
    61500   720000   108000    31000    52500    91000  1600000   256000
    72500   185000   112000    65720    72000   111775    93150    21600
  4900000  1250000   190000  1200000    21000  4000000  1799997     9272
   147000   120500    21844     4000    22000    76760  1672000   420000
 30400000   195000    32000   416000    40900  2500000     8000  4450000
   423000    56000   299000    98000   325000    34000   600000    69600
   435000    37000    19000    74000   152000    18000   102000    39600
  1335000  1450000    73000   190200   118000   138350   130800   168000
   412000   151000] 

salary_currency ['SGD' 'USD' 'EUR' 'AUD' 'GBP' 'CAD' 'INR' 'CNY' 'PLN' 'CHF' 'JPY' 'HUF'
 'MXN' 'TRY' 'CLP' 'DKK' 'BRL'] 

salary_in_usd [ 65950 200000 180000 153000 210000 100000 150075 110925  22800 160000
  92000 202900 131300  22809  15000 175000 135000 138851  83000  97341
  71383 186000 148800 112900  90320 240000 300000  68324 123298 120000
 145000 105400  49268 116809 215300 158200 209100 154600 115934  81666
 155000  87454 164000 132000 170000 123000 189650 164996  54659 117990
 165400 132320 208775 147800 136994 101570 128875  93700  78747  36989
 183600 100800  51915  38936  43727  32795  76522 103830  90851  77873
 140400  65591  49193 260000  45425  58404  51064  60000  82900  63900
 112300 241000 159000  80000  58000 136000 108800 242000  64894 165220
 120160 124190 181940 220110 160080 126500 106260 116000  99000 120600
 130000  90000 150000 102100  84900 136620  99360 146000 110000 161342
 137141 167000 211500 138600 192400  90700  61300 113000  95550 115500
 243900 156600 136600 109280 224000 167875 205300 176000 144000 200100
  70500  54000 184700 175100 140250 116150  99050  85000  75000 214000
 192600 266400 213120 115000 141300 206699  99100 110500 192564 144854
 230000 150260  67000  52000 154000 126000 129000 140000  69000  25000
 105000  50000 220000 181703  65000 324000 216000 185100 104890  78660
 117104 196650  37047  18374  31498  57938 157000  70794  71056 109000
  69221  10000  20000 102839  52309  78000  87052  40000  62311  54742
  92920  86332 165000  35372  31702  69999  57720  64497 152500 405000
 380000 121787 177000  67777  48000  77364  63711 161791  24000  38400
  82500  49646  40570 125000   9466  10354 110037  21863  82744  59303
  62649  82528  55000 250000  70000 130026  63831  68428 450000  46759
  74130 127221  13400  75774 103000  12000   5409 270000  54238  47282
 153667  28476  59102 138000  79197  45760  53192 235000  79833 225000
  76833  50180  88654 103160 113476  94564  30428 187442  51519 106000
 112872  36259  15966  95746  76958  89294  29751 276000 188000 174000
  93000  28399  60757  70139   6072  33511  96282  12103  36643  72212
  91000  99703 103691  21637  42000  63810 109024 256000  72500 185000
  69741 112000  20171  77684  72000  65013  28016 111775  93150  25532
  66265  16904 190000 141846  16228  71786  35735  24823  54094  24342
   9272 147000  96113  21844  51321  40481   4000  39916  87000  26005
  90734  22611   5679  81000  40038   2859  61467 195000  37825 416000
  56256  33808 116914  46597   8000  41689 114047   5707  56000  28609
  43331  47899  98000  66022  56738 325000  45896  40189 600000  12901
   5882  42197  62726  21669  87738  61896  74000 152000  18000  18907
 173762 148261  38776  46809  18053  91237  19609  62000  73000  45391
 190200 118000 138350 130800  45618 168000 119059 423000  28369 412000
 151000  94665] 

employee_residence ['SG' 'US' 'EG' 'PT' 'ID' 'AU' 'GB' 'DE' 'IN' 'FR' 'GR' 'CA' 'ES' 'IT'
 'AR' 'AE' 'BO' 'IE' 'SI' 'MY' 'JP' 'EE' 'NL' 'PK' 'BR' 'PL' 'HN' 'TN'
 'CZ' 'AT' 'CH' 'RU' 'DZ' 'VN' 'IQ' 'BE' 'UA' 'NG' 'BG' 'PH' 'HU' 'MX'
 'TR' 'JE' 'PR' 'RS' 'KE' 'CO' 'NZ' 'IR' 'RO' 'CL' 'DK' 'CN' 'HK' 'MD'
 'LU' 'HR' 'MT'] 

remote_ratio [ 50 100   0] 

company_location ['SG' 'US' 'EG' 'PT' 'ID' 'AU' 'GB' 'DE' 'GR' 'CA' 'IN' 'ES' 'IT' 'MX'
 'AE' 'IE' 'LU' 'SI' 'MY' 'EE' 'NL' 'FR' 'PL' 'HN' 'CZ' 'AT' 'CH' 'PK'
 'JP' 'DZ' 'BR' 'RO' 'IQ' 'BE' 'RU' 'UA' 'NG' 'DK' 'TR' 'CN' 'HU' 'KE'
 'CO' 'NZ' 'IR' 'CL' 'MD' 'VN' 'AS' 'HR' 'IL' 'MT'] 

company_size ['M' 'L' 'S'] 

monthly_salary [   7500   16666   15000   12750   17500    8333   12506    9243    1900
   13333    7666   16908   10941    1666    1250   14583   11250   16083
    6916    6250    4583   15500   12400    9408    7526   20000   25000
    5208    7916   10000   12083    8783    3600   17941   13183   17425
   12883    9661    6805   12916    6666   13666   11000   14166   10250
   15804   13749    4166   12500   13783   11026   17397   12316   11416
    8464   10739    7808  500000    2375   15300    8400    3333    2500
    5833    5000   11700    3750   21666    2916    6908    5325    9358
   20083   13250    4833   11333    9066   20166   13768   10013   10349
   15161   18342   13340   10541    8855    9666    8250   10050   10833
    8508    7075   11385    8280   12166    9166   13445   11428   13916
   17625   11550   16033    7558    5108    9416    7962    9625   20325
   13050   11383    9106   18666   13989   17108   14666   12000   16675
    5875    4500   15391   14591   11687    9679    8254    7083   17833
   16050   22200   17760    9583   11775   17224    8258    9208   16047
   12071   19166   12521    5583    4333   12833   10500   10750   11666
    5750    2083    8750   18333    5416   27000   18000   15425    8740
   13083   20833  116666  200000    4416    9083    7333     833    7250
    5541    6500   10083    4750    4000   13750    2416    4400    4916
   12708   33750   31666   14750    5166  708333  583333   12333    2000
    3200    6875    3500  250000   10416   58333     730   37500    3416
   13291    1116    8583    1000   33333   22500    5666   11500    3813
   19583   18750    3666  187500    3121    8833  916666    1166    6750
  183333   23000   15666   14500    7750  175000    4283    5125   60000
    9000    2583    4375    7583  133333   21333    6041   15416    9333
    5476    6000    9314    7762    1800  408333  104166   15833  100000
    1750  333333  149999     772   12250   10041    1820     333    1833
    6396  139333   35000 2533333   16250    2666   34666    3408  208333
     666  370833   35250    4666   24916    8166   27083    2833   50000
    5800   36250    3083    1583    6166   12666    1500    8500    3300
  111250  120833    6083   15850    9833   11529   10900   14000   34333
   12583] 

From the information above, there does not appear to be any bad data in the dataset. The values in the dataset for each column appear to conform to a common standard. If we consider 'company_location', for example, we can see the standard is the abbreviated location. If a value in this column appeared as "Great Britain", this is where the value would be replaced with "GB" in order to standardise the data. The only point to note is that there are two very similar job titles that likely relate to the same thing. The title 'ML Engineer' is probably the same job as 'Machine Learning Engineer'.

What is notable from the above information, though, is that there is no obvious index column. A unique identifier might be a single attribute, or it might be a combination of attributes and relationships. It is important to be able to identify a single entity like an employee for accuracy. This is crucial in databases because it enables users to locate one unique record among many records. If an employee leaves the organisation, for example, we want to maintain the integrity of the data by removing, or hiding, the correct employee from the user so that ex-employee is not confused for someone else.

For this dataset, I will create an artificial unique identifier for the sole purpose of being able to identify a single employee. I will also replace 'ML Engineer' with 'Machine Learning Engineer'.

In [10]:
df.insert(0, 'employee_id', range(1, 1+len(df)))
In [11]:
df['job_title'] = df['job_title'].replace({'ML Engineer':'Machine Learning Engineer'})

To summarise this section, we have:

  • identified number of potential features;
  • identified number of missing values;
  • identified type of features;
  • created a new variable using existing data;
  • considered what would have to be done if data were missing;
  • considered how to standardise the data; and
  • inserted an employee_id column to enable users to identify each employee.

With the dataset in good condition to move forward, the next task is to conduct Exploratory Data Analysis (EDA).

Part 3: Exploratory Data Analysis (EDA)¶

In performing EDA, we can gain a better understanding of the dataset. One important aspect of EDA is understanding the distribution of variables; that is, the probability of a variable taking on a range of values. This can prevent problems arising because it establishes if the dataset contains outliers, large majority values, and flat and wide values.

A dataset that contains such occurrences can result in misleading information. For example, an outlier is a value which is very rare. This could indicate a data entry error. The rare value might be true - it simply depends on the context; however, care and consideration should be given to such values.

Furthermore, a large majority value could falsely suggest there is a lot of data. In fact, large majority values all relate to the same thing, hence falsely suggesting a large volume of data.

In addition, care should be taken where variables are flat and wide. These are minorities in the dataset where there is very few occurrences of these values. An example is an identifier variable. It is certain these variables will occur only one or two times in the dataset; they will not be of much use when trying to detect relationships and trends, for example. In this project, an example is the newly created 'employee_id' field because it is a variable which is used to identify each single employee; each employee will have a unique index number that will obviously only occur once.

This EDA will explore the dataset further that will answer the questions outlined at the start of this notebook either in part or in full. The EDA may lead to more questions that will add more depth to the basic analysis. Prior to visualisation of distributions, some summary statistics are described below.

Summary Statistics¶

The function describe() is very helpful in calculating some descriptive statistics of the dataset. Below we can see various statistics such as the mean of each numerical column, along with the minimum and maximum value. The minimum (earliest) year value is 2020 and the maximum is 2022, suggesting the dataset covers 2020 to 2022. We can also see that the lowest salary is 2,859 USD; the highest being 600,000 USD; and the average is 113,275 USD.

In [12]:
df.describe()
Out[12]:
employee_id work_year salary salary_in_usd remote_ratio monthly_salary
count 637.000000 637.000000 6.370000e+02 637.000000 637.000000 6.370000e+02
mean 319.000000 2021.430141 3.151061e+05 113275.439560 70.879121 2.625854e+04
std 184.030342 0.689250 1.508096e+06 70874.620746 40.869244 1.256747e+05
min 1.000000 2020.000000 4.000000e+03 2859.000000 0.000000 3.330000e+02
25% 160.000000 2021.000000 7.000000e+04 63831.000000 50.000000 5.833000e+03
50% 319.000000 2022.000000 1.150000e+05 103000.000000 100.000000 9.583000e+03
75% 478.000000 2022.000000 1.650000e+05 150075.000000 100.000000 1.375000e+04
max 637.000000 2022.000000 3.040000e+07 600000.000000 100.000000 2.533333e+06

Variable Distributions¶

To visualise categorical variables, we can display the number of occurrences of each categorical value using seaborn's countplot(). Simply put, this visualisation "can be thought of as a histogram across a categorical, instead of quantitative, variable" [2]. I will utilise this to visualise the distribution of categorical variables.

To visualise numerical variables, it is appropriate to visualise distribution through a histogram, similar to a bar plot, where the axis that represents the variable in question is divided into a number of bins. Within each bin is the number of occurrences of each value the variable takes - the more frequent the occurrence, the higher the bin. Using seaborn, a histogram can be generated through displot() or histplot(). [3]

It should be noted that there is no data guide or data dictionary. Some assumptions are made with regard to what some values represent and these are made clear where applicable.

In [13]:
#required imports
import seaborn as sns
import matplotlib.pyplot as plt
In [14]:
df['work_year'].value_counts()
Out[14]:
2022    347
2021    217
2020     73
Name: work_year, dtype: int64
In [15]:
sns.countplot(df, x='work_year', palette='RdPu')
Out[15]:
<Axes: xlabel='work_year', ylabel='count'>

We can see from the above that the distribution of 'work_year' is not entirely imbalanced. The dataset contains mostly employees' work records from 2022. There are slightly less occurrences of salaries in 2021, and significantly less in 2020.

In [16]:
df['experience_level'].value_counts()
Out[16]:
SE    299
MI    220
EN     92
EX     26
Name: experience_level, dtype: int64
In [17]:
sns.countplot(df, x='experience_level', palette='RdPu')
Out[17]:
<Axes: xlabel='experience_level', ylabel='count'>

Here I will assume:

  • MI means "Middle";
  • EN means "Entry";
  • SE means "Senior"; and
  • EX means "Experienced".

We can see from the above chart that most employees in the dataset are Senior, with the next most common experience_level being "Middle". We then have a lower proportion of Entry level employees, and an even lower proportion of Experienced employees.

In [18]:
df['employment_type'].value_counts()
Out[18]:
FT    618
PT     10
CT      5
FL      4
Name: employment_type, dtype: int64
In [19]:
sns.countplot(df, x='employment_type', palette='RdPu')
Out[19]:
<Axes: xlabel='employment_type', ylabel='count'>

Here I will assume:

  • FT means "Full Time";
  • PT means "Part Time";
  • CT means "Contract"; and
  • FL means "Freelance".

We can see from the above information that the vast majority of employees are employed on a "Full Time" basis. Of course the chart describes that the distribution is not balanced, which raises the question of whether the few rows that are not "FT" should remain or should be disposed of. Given that we may wish to consider further the employment types of careers in AI, it might be useful to simply retain the 19 rows that are not "FT".

In [20]:
#to view full list
pd.set_option('display.max_rows', None)
In [21]:
df['job_title'].value_counts()
Out[21]:
Data Scientist                              148
Data Engineer                               139
Data Analyst                                104
Machine Learning Engineer                    52
Research Scientist                           16
Data Science Manager                         15
Data Architect                               11
Machine Learning Scientist                    9
AI Scientist                                  8
Big Data Engineer                             8
Data Science Consultant                       7
Data Analytics Manager                        7
Director of Data Science                      7
Principal Data Scientist                      7
BI Data Analyst                               6
Lead Data Engineer                            6
Computer Vision Engineer                      6
Data Engineering Manager                      5
Applied Data Scientist                        5
Head of Data                                  5
Business Data Analyst                         5
Machine Learning Developer                    4
Analytics Engineer                            4
Applied Machine Learning Scientist            4
Head of Data Science                          4
Data Analytics Engineer                       4
Machine Learning Infrastructure Engineer      3
Data Science Engineer                         3
Lead Data Analyst                             3
Principal Data Engineer                       3
Computer Vision Software Engineer             3
Lead Data Scientist                           3
Director of Data Engineering                  2
Product Data Analyst                          2
Principal Data Analyst                        2
ETL Developer                                 2
Financial Data Analyst                        2
Cloud Data Engineer                           2
Data Analytics Lead                           1
Finance Data Analyst                          1
Staff Data Scientist                          1
Big Data Architect                            1
3D Computer Vision Researcher                 1
Marketing Data Analyst                        1
NLP Engineer                                  1
Machine Learning Manager                      1
Head of Machine Learning                      1
Lead Machine Learning Engineer                1
Data Specialist                               1
Name: job_title, dtype: int64

From the information above I would argue that it is not entirely suitable to visualise the distribution of 'job_title' as there are many values that occur only once or twice. This suggests the variable may be flat and wide. However, it may be useful to be able to visualise the 10 most common jobs in AI. To do this, we can make use of create a new DataFrame that contains this information. This approach will visualise an answer to the question what is the most common job in AI?

In [22]:
def getTop10Jobs():
    top_10_jobs = df['job_title'].value_counts()[0:11]
    return top_10_jobs
In [23]:
top_10_jobs = getTop10Jobs()
top_10_jobs
Out[23]:
Data Scientist                148
Data Engineer                 139
Data Analyst                  104
Machine Learning Engineer      52
Research Scientist             16
Data Science Manager           15
Data Architect                 11
Machine Learning Scientist      9
AI Scientist                    8
Big Data Engineer               8
Data Science Consultant         7
Name: job_title, dtype: int64
In [24]:
def displayTop10Jobs(top_10):
    top_10_jobs.plot(kind='barh', color='plum')
In [25]:
displayTop10Jobs(top_10_jobs)

Evidently, the 'Data Scientist' is the most common job in AI according to this dataset. This chart simply shows the number of times these job titles appears in the dataset. Most values are one of 'Machine Learning Engineer', 'Data Analyst', 'Data Engineer', or 'Data Scientist'.

In addition, we can address another question outlined earlier easily by making use of the Statistics library in Python. We have already identified the highest salary as being 600,000 USD. To explicitly calculate the average salary of jobs in AI, we can simply invoke the mean() function and supply our column name, 'salary'.

In [26]:
import statistics
In [27]:
print("Average salary of jobs in AI (in USD):", int(statistics.mean(df['salary'])))
Average salary of jobs in AI (in USD): 315106

It is useful to know where the majority of jobs in AI are located. As such, I inspect the variables 'employee_residence' and 'company_location' more closely below.

In [28]:
df['employee_residence'].value_counts()
Out[28]:
US    354
GB     46
IN     30
CA     29
DE     26
FR     18
ES     15
GR     13
PT      7
JP      7
PK      6
BR      6
NL      5
IT      4
RU      4
AU      4
PL      4
VN      3
TR      3
SG      3
AT      3
AE      3
HU      2
NG      2
BE      2
RO      2
DK      2
MX      2
SI      2
CN      1
HK      1
MD      1
JE      1
LU      1
IR      1
NZ      1
HR      1
CO      1
KE      1
RS      1
PR      1
CL      1
EE      1
EG      1
MY      1
PH      1
BG      1
UA      1
IQ      1
ID      1
DZ      1
AR      1
CH      1
CZ      1
TN      1
HN      1
BO      1
IE      1
MT      1
Name: employee_residence, dtype: int64

This is a similar situation to the distribution of 'job_title'. Similarly, many 'employee_residence' values occur only once or twice, suggesting a flat and wide distribution. As such, I will continue to visualise the 15 most common 'employee_residence' values.

In [29]:
#store 15 most common employee_residences
top_15 = df['employee_residence'].value_counts()[0:16]
In [30]:
top_15
Out[30]:
US    354
GB     46
IN     30
CA     29
DE     26
FR     18
ES     15
GR     13
PT      7
JP      7
PK      6
BR      6
NL      5
IT      4
RU      4
AU      4
Name: employee_residence, dtype: int64
In [31]:
top_15.plot(kind='barh', color='plum')
Out[31]:
<Axes: >

The chart above describes that the vast majority of employees reside in the US.

In [32]:
df['company_location'].value_counts()
Out[32]:
US    377
GB     49
CA     30
DE     29
IN     24
FR     15
ES     14
GR     11
JP      6
PT      5
AT      4
AU      4
PL      4
NL      4
BR      3
PK      3
MX      3
LU      3
AE      3
DK      3
TR      3
RU      2
BE      2
NG      2
CN      2
SG      2
CH      2
CZ      2
SI      2
IT      2
KE      1
IL      1
HR      1
AS      1
VN      1
MD      1
CL      1
IR      1
NZ      1
CO      1
EG      1
HU      1
HN      1
ID      1
IE      1
UA      1
MY      1
IQ      1
RO      1
EE      1
DZ      1
MT      1
Name: company_location, dtype: int64
In [33]:
#store 15 most common company_locations
top_15_company_locations = df['company_location'].value_counts()[0:16]
In [34]:
top_15_company_locations
Out[34]:
US    377
GB     49
CA     30
DE     29
IN     24
FR     15
ES     14
GR     11
JP      6
PT      5
AT      4
AU      4
PL      4
NL      4
BR      3
PK      3
Name: company_location, dtype: int64
In [35]:
top_15_company_locations.plot(kind='barh', color='purple')
Out[35]:
<Axes: >

The chart above describes that the companies the employees are employed by are mostly located in the US.

In [36]:
df['company_size'].value_counts()
Out[36]:
M    346
L    207
S     84
Name: company_size, dtype: int64
In [37]:
sns.countplot(df, x='company_size', palette='RdPu')
Out[37]:
<Axes: xlabel='company_size', ylabel='count'>

Here I assume:

  • S means "Small";
  • M means "Medium"; and
  • L means "Large".

We can see from the chart above we can see the distribution of 'company_size' values is fairly evenly distributed. In addressing the question where are the majority of jobs in AI? this visualisation answers this in part. If "where" refers to companies and not locations, we can say that it is clear that most employees are working in a medium-sized company. It is also clear that a high number of employees work in large companies, with less than 100 employees working in small companies.

We explore where the majority of jobs in AI are above, but do not consider employees who work remotely. The distribution of 'remote_ratio' is presented below.

In [38]:
df['remote_ratio'].value_counts()
Out[38]:
100    401
0      135
50     101
Name: remote_ratio, dtype: int64
In [39]:
sns.countplot(df, x='remote_ratio', palette='RdPu')
Out[39]:
<Axes: xlabel='remote_ratio', ylabel='count'>

Here I will assume:

  • 0 means no remote work;
  • 50 means some remote work; and
  • 100 means the employee works entirely remotely.

We can see that a large number of employees work remotely. The number of employees that do not work remotely at all and the number of employees who work remotely some of the time is fairly even.

Correlation Matrix¶

So far we have considered single variables on their own. Before moving on to Part 4: Data Visualisation, an important part of EDA is evaluating relationships/correlations between two variables. For example, we seen how most employees reside in the US, and most companies are located in the US. A correlation matrix can help us in identifying the relationship between these two variables. A correlation matrix is an extremely useful tool that allows us to assess the strength of relationships between two given variables.

One major advantage of using Python to analyse data is the support offered by its libraries. This makes implementing the correlation matrix significantly simpler and less time consuming. We can generate a correlation matrix by invoking the corr() function, and we can generate a visual representation of the correlation matrix through libraries seaborn and matplotlib. It is an effective technique as it is not only easy to implement, but it is easy to for stakeholders and analysts alike to interpret too:

  • the closer to 1, the stronger the relationship; and
  • the closer to -1, the weaker the relationship.
In [40]:
#the plain correlation matrix
correlation_matrix = df.corr()
correlation_matrix
C:\Users\daria\AppData\Local\Temp\ipykernel_34740\3299584711.py:2: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  correlation_matrix = df.corr()
Out[40]:
employee_id work_year salary salary_in_usd remote_ratio monthly_salary
employee_id 1.000000 -0.772597 0.106603 -0.165687 -0.006387 0.106603
work_year -0.772597 1.000000 -0.089774 0.180503 0.068606 -0.089774
salary 0.106603 -0.089774 1.000000 -0.081715 -0.013806 1.000000
salary_in_usd -0.165687 0.180503 -0.081715 1.000000 0.131238 -0.081714
remote_ratio -0.006387 0.068606 -0.013806 0.131238 1.000000 -0.013806
monthly_salary 0.106603 -0.089774 1.000000 -0.081714 -0.013806 1.000000

In visualising the correlation matrix, the below code was abstracted from W3Schools and modified accordingly [4].

In [41]:
import matplotlib.pyplot as plt
In [42]:
#visualise the correlation matrix
#following code abstracted from W3Schools
axis_corr = sns.heatmap(
correlation_matrix, #use our correlation_matrix
vmin = -1, vmax = 1, center = 0, #set min to -1, max to 1, center to 0
#cmap = sns.diverging_palette(50, 500, n=500), #define colours
cmap = sns.color_palette('RdPu'), #define colours
square = True #show me the squares
)

plt.show()

Part 4: Data Visualisation¶

We have a lot of information gathered now about our dataset and have found answers to the questions outlined at the start of the notebook. We can now go further to graphically represent any trends and patterns in the dataset, and to visualise how variables relate to one another.

Box Plot¶

The box plot above is effective in visualising the distribution of salaries across the time interval of 3 years. It allows users to easily compare statistics like the median of each group (in this case, each year). It neatly presents outliers as dots outside of the box too.

Violin Plot¶

So far, this project has made use of line plots, bar plots, and pie charts to visualise road accident data. In visualising accidents by road type, a pie chart would do the job nicely. However, it is also suitable to explore visualising this data using a violin plot, which effectively presents the relationship between two variables. Of course, the box plot can describe basic distributions; however, the violin plot is really a mix of the box plot and kernel density plot. Therefore, the violin plot comes with added advantages over the simple pie chart and box plot as it also has the ability to present summary statistics along with each variable's density.

Below, there are two violin plots. Each uses and presents the same data, with one utilising Seaborn and the other Plotly. The violin plot created with Seaborn is, to me, easier to look at as it shows very clearly the median as a white dot, along with the interquartile range and 1.5x interquartile range through the box plot inside the violin. The violin plot produced using Plotly is more engaging, as a user can hover over it and inspect the data further.

Salaries from 2020 to 2022¶

In [43]:
import plotly.express as px
In [44]:
sns.boxplot(x='work_year', y='salary_in_usd', data=df, palette='RdPu')
Out[44]:
<Axes: xlabel='work_year', ylabel='salary_in_usd'>
In [45]:
fig = px.box(df, x='work_year', y='salary_in_usd')
fig.update_traces(line_color='purple')
fig.show()
In [46]:
def salByYearViolin(): 
    f, ax = plt.subplots(figsize=(8, 8))

    # Show each distribution with both violins and points
    sns.violinplot(x='work_year',y="salary_in_usd",data=df, inner="box", cut=2, linewidth=3, palette='RdPu')

    sns.despine(left=True)

    f.suptitle('Salary by Year', fontsize=18, fontweight='bold')
    ax.set_xlabel("Year",size = 16,alpha=0.7)
    ax.set_ylabel("Salary in USD",size = 16,alpha=0.7)
In [47]:
salByYearViolin()
In [48]:
def salByYearViolinHover():
    fig = px.violin(df, y='salary_in_usd', x='work_year')

    fig.update_layout(
        title="Salaries by Year",
        yaxis_title="Salaries (USD)",
        xaxis_title="Year"
    )
    
    fig.update_traces(line_color='purple')

    fig.show()
In [49]:
salByYearViolinHover()

Experience Level and Salaries¶

As a student, it is always interesting to find out how salaries can increase with more experience. We can visualise this by plotting salaries against experience_level.

In [50]:
sns.boxplot(x='experience_level', y='salary_in_usd', data=df, palette='RdPu')
Out[50]:
<Axes: xlabel='experience_level', ylabel='salary_in_usd'>
In [51]:
def salByExpLevelBox():
    fig = px.box(df, x='experience_level', y='salary_in_usd')
    fig.update_traces(line_color='purple')
    fig.show()
In [52]:
salByExpLevelBox()
In [53]:
plt.figure(figsize=(15,9))
sns.kdeplot(data=df, x='salary_in_usd', hue='experience_level', fill=False, linewidth=5)
plt.title("Distribution of Salary by Experience Level", fontsize=20)
plt.xlabel("Salary (in USD)", fontsize=18)
plt.ylabel("Density", fontsize=18)
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)
plt.show()

The kernel density plot above describes, as expected, that employees who are more experienced earn significantly more than those with less experience. The green line in the plot describes that Senior employees in AI earn the highest salary. What is surprising in this kernel density plot, and perhaps very misleading, is that employees of experience level "EX" (which I have assumed is one of the highest levels of experience) appear to earn a lower salary overall. However, this can be explained if we recall the variable distribution of 'experience_level' - the number of employees with this experience_level is significantly lower than the number of employees with other 'experience_levels'. Indeed, the dataset contains only 26 rows with the value "EX". Therefore, the kernel density plot above is perhaps not a completely true reflection of reality.

Salaries According to Job Title¶

It is encouraging to see that the median salary for experience_level "EN" is a tidy sum of 56.36k. What would also be interesting to find out is the average salary according to job. To investigate this, a pivot table can help. A pivot table is a significantly less time consuming way to calculate the number of a given group. This can then be displayed nicely using a bar chart, which can be made more user-friendly and more readable by formatting labels horizontally. This will allow users to glance at the plot and quickly locate information.

In [54]:
def createSalsByJobPivotTable():
    #set index to job_title, use salary_in_usd as values
    #set aggfunc to mean as we want to see each job's average salary
    salaries_by_job_pivot_table = pd.pivot_table(data=df,index=['job_title'],
                                                 values=['salary_in_usd'], 
                                                 aggfunc='mean').sort_values(by=['salary_in_usd'], ascending=False)
    return salaries_by_job_pivot_table
In [55]:
salaries_by_job_pivot_table = createSalsByJobPivotTable()
salaries_by_job_pivot_table #For testing purposes
Out[55]:
salary_in_usd
job_title
Data Analytics Lead 405000.000000
Principal Data Engineer 328333.333333
Financial Data Analyst 275000.000000
Principal Data Scientist 215116.285714
Director of Data Science 195027.000000
Data Architect 177873.909091
Applied Data Scientist 175655.000000
Analytics Engineer 175000.000000
Data Science Manager 169252.866667
Data Specialist 165000.000000
Head of Data 160126.800000
Director of Data Engineering 156738.000000
Head of Data Science 146718.750000
Machine Learning Scientist 143344.444444
Applied Machine Learning Scientist 142025.500000
Lead Data Engineer 139691.666667
Data Analytics Manager 127134.285714
Cloud Data Engineer 124647.000000
Data Engineering Manager 123227.200000
Principal Data Analyst 122500.000000
Machine Learning Manager 117104.000000
Lead Data Scientist 115190.000000
Data Engineer 113119.223022
Machine Learning Engineer 111893.230769
Data Scientist 109480.628378
Machine Learning Developer 109330.000000
Research Scientist 108965.812500
Computer Vision Software Engineer 105248.666667
Staff Data Scientist 105000.000000
Machine Learning Infrastructure Engineer 101039.333333
Big Data Architect 99703.000000
Lead Data Analyst 92203.000000
Data Analyst 92077.701923
Marketing Data Analyst 88654.000000
Lead Machine Learning Engineer 87454.000000
AI Scientist 82868.625000
Head of Machine Learning 78747.000000
Business Data Analyst 76654.000000
Data Science Engineer 75803.333333
BI Data Analyst 74755.166667
Data Science Consultant 69420.714286
Data Analytics Engineer 64799.250000
Finance Data Analyst 61896.000000
ETL Developer 54659.000000
Big Data Engineer 51974.000000
Computer Vision Engineer 44419.333333
NLP Engineer 37047.000000
Product Data Analyst 13036.000000
3D Computer Vision Researcher 5409.000000
In [56]:
def displayTop20AvgSals(pivot_table):
    #supply argument 20 to head() to display only 20 average salaries
    avg_sals = pivot_table.head(20).plot(kind='barh', color='plum')
    avg_sals.tick_params(axis='x', labelrotation = 45)
In [57]:
displayTop20AvgSals(salaries_by_job_pivot_table)
In [58]:
def displayWorst20AvgSals(pivot_table):
    #supply argument 20 to tail() to display only 20 average salaries
    avg_sals = salaries_by_job_pivot_table.tail(20).plot(kind='barh', color='purple')
    avg_sals.tick_params(axis='x', labelrotation = 45)
In [59]:
displayWorst20AvgSals(salaries_by_job_pivot_table)

Now, we can see clearly that the highest average salary belongs to 'Data Analytics Lead'. The lowest average salary belongs to '3D Computer Vision Researcher'.

Salaries by Location¶

For many entry-level positions, remote work is not always an option. I would like to inspect the salaries according to company location. To do this, I will make use of a bar plot where the x axis represents the country location and the y axis represents the average salary. We must first extract a subset of data from the DataFrame, where we group the dataset by 'company_location', find the average of 'salary_in_usd' according to each group, and then sort the data from highest to lowest.

In [60]:
def getAvgSalsByCompLocation():
    #extract the data
    avg_sals_comp_location = df.groupby('company_location')[['salary_in_usd']].mean().sort_values('salary_in_usd', ascending = False)
    return avg_sals_comp_location
In [61]:
def displayAvgSalsByCompLocation(avg_sals_comp_location):    
    fig = px.bar(x=avg_sals_comp_location.head(25).index,
                 y=avg_sals_comp_location.head(25).salary_in_usd,
                 color=avg_sals_comp_location.head(25).salary_in_usd,
                color_continuous_scale=px.colors.sequential.Purp)

    fig.update_layout(
        title="Top 25 Average Salaries by Company Location",
        yaxis_title="Salary (USD)",
        xaxis_title="Country Code"
    )


    fig.show()
In [62]:
avg_sals_comp_location = getAvgSalsByCompLocation()
avg_sals_comp_location
Out[62]:
salary_in_usd
company_location
RU 157500.000000
US 144988.458886
NZ 125000.000000
IL 119059.000000
AU 115558.750000
JP 114127.333333
DZ 100000.000000
IQ 100000.000000
AE 100000.000000
CA 99786.800000
BE 85699.000000
DE 81334.965517
GB 81308.857143
SG 77622.000000
AT 72832.750000
CN 71665.500000
IE 71056.000000
PL 66028.000000
FR 63912.200000
CH 63834.500000
SI 63831.000000
RO 60000.000000
NL 54860.750000
DK 54386.333333
ES 52921.571429
GR 52062.545455
CZ 50850.500000
HR 45618.000000
LU 43942.666667
PT 42709.400000
CL 40038.000000
MY 40000.000000
IT 36366.500000
HU 35735.000000
EE 32795.000000
MX 32123.333333
NG 30000.000000
IN 28559.041667
MT 28369.000000
EG 22800.000000
CO 21844.000000
TR 20096.666667
HN 20000.000000
BR 18602.666667
AS 18053.000000
MD 18000.000000
ID 15000.000000
UA 13400.000000
PK 13333.333333
KE 9272.000000
IR 4000.000000
VN 4000.000000
In [63]:
displayAvgSalsByCompLocation(avg_sals_comp_location)

Remote Working by Year¶

But money is not always the top priority for many employees. For some, being able to work in one's home environment is important to their work-life balance. It would be useful to be able to see if the potential for home-working is realistic or not.

In [64]:
#for readability
df['remote_ratio'] = df['remote_ratio'].replace({0:'No remote work', 50:'Some remote work', 100:'Fully remote'})
In [65]:
def displayRemoteWorkingByYear():
    fig = px.histogram(df, x='remote_ratio', color='work_year', barmode='group',
                      category_orders={
                          'remote_ratio': ['No remote work', 'Some remote work', 'Fully remote'],
                          'work_year': [2020, 2021, 2022]
                      },
                      text_auto=True, #display values in bars
                      color_discrete_sequence=['rebeccapurple', 'plum', 'darkorchid'] # color of histogram bars
                      )

    fig.update_layout(
        title="Remote Working 2020-2022",
        yaxis_title="Number of Employees",
        xaxis_title="Remote Ratio"
    )

    fig.show()
In [66]:
displayRemoteWorkingByYear()
In [67]:
!pip install wordcloud
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: wordcloud in c:\users\daria\appdata\roaming\python\python311\site-packages (1.9.2)
Requirement already satisfied: numpy>=1.6.1 in c:\programdata\anaconda3\lib\site-packages (from wordcloud) (1.24.3)
Requirement already satisfied: pillow in c:\programdata\anaconda3\lib\site-packages (from wordcloud) (9.4.0)
Requirement already satisfied: matplotlib in c:\programdata\anaconda3\lib\site-packages (from wordcloud) (3.7.1)
Requirement already satisfied: contourpy>=1.0.1 in c:\programdata\anaconda3\lib\site-packages (from matplotlib->wordcloud) (1.0.5)
Requirement already satisfied: cycler>=0.10 in c:\programdata\anaconda3\lib\site-packages (from matplotlib->wordcloud) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in c:\programdata\anaconda3\lib\site-packages (from matplotlib->wordcloud) (4.25.0)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\programdata\anaconda3\lib\site-packages (from matplotlib->wordcloud) (1.4.4)
Requirement already satisfied: packaging>=20.0 in c:\programdata\anaconda3\lib\site-packages (from matplotlib->wordcloud) (23.0)
Requirement already satisfied: pyparsing>=2.3.1 in c:\programdata\anaconda3\lib\site-packages (from matplotlib->wordcloud) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in c:\programdata\anaconda3\lib\site-packages (from matplotlib->wordcloud) (2.8.2)
Requirement already satisfied: six>=1.5 in c:\programdata\anaconda3\lib\site-packages (from python-dateutil>=2.7->matplotlib->wordcloud) (1.16.0)
In [68]:
#code to construct a wordcloud abstracted from GeeksforGeeks [5]
from wordcloud import WordCloud, STOPWORDS
In [69]:
from collections import Counter
In [70]:
def buildWordCloud():    
    #store job titles in Counter class object 
    words=Counter(df.job_title)

    #Configure the wordcloud
    wordcloud = WordCloud(width = 5000, height = 3500, background_color='black', colormap='RdPu',
        collocations= False, #exclude collocations of two words
        stopwords=STOPWORDS, #used to remove common words such as conjunctions and pronouns
    )

    #Create the word cloud using job titles and their frequencies
    wordcloud.generate_from_frequencies(words)
    return wordcloud

    
In [71]:
def displayWordcloud():
    wordcloud = buildWordCloud()
    #Set overall size
    plt.figure(figsize=(25,20))
    #Don't display axes
    plt.axis("off")
    #imshow() function to display wordcloud data as an image
    plt.imshow(wordcloud)

Part 5: Data Storytelling¶

Why Consider a Career in AI?¶

Whether you are in your senior years of high school, just starting out in your career, or thinking of taking a completely new direction, there's a job in AI for you. Read on to find out why you should consider a career in the AI industry.

It's no secret that technology has grown exponentially in recent decades. For many, Artificial Intelligence is a mystery, but it's been around for longer than some think. AI has become far more prominent recently as computing equipment is far less expensive than it was in the 20th Century. Machines can now remember significantly more information now than in the 1950s. You might not realise it, but you're probably interacting with AI every day - from self-serve checkouts to chats with Alexa, and from fraud detection to 3D printed prostheses. The bottom line is, AI career possibilities are endless. The number of jobs in AI is increasing rapidly, and there's no sign of this stopping. We need all sorts of people to keep the world running, and these are just some of the jobs that you could be doing in AI:

In [72]:
displayWordcloud()

So what's in it for you?¶

Working in AI isn't just for coders sitting in a dark room for hours on end. It's a diverse field, and there's probably something for everyone.

It's important you're rewarded for your hard work. Having a decent income will give you financial security. From our data, we've calculated the highest salary being 600,000 USD, and the average as 113,275 USD. That's just the average over a three year period.

In [73]:
salByYearViolin()

Look closely at the chart. The white dot represents the average salary for that year. Take a closer look below by hovering over the chart.

In [74]:
salByYearViolinHover()

The average salary rises from 74.13k in 2020 to 82.528k in 2021, and to 120.16k in 2022. This can't be said for all industries. Even if you're just starting out in AI, entry-level positions offer a very good pay package:

In [75]:
salByExpLevelBox()

What this shows you is the average salary for entry level positions is 56.36k. And the more experienced you become, the higher the salary you will likely receive. With around 5-10 years of experience of working in AI, the average salary rises to 76.522k.

What job pays the most?¶

Hopefully, you can see that careers in AI are financially rewarding. But with so many job titles being thrown around, how do we know which jobs pay the most? Look below. These are the top 25 average salaries by job.

In [76]:
displayTop20AvgSals(salaries_by_job_pivot_table)

You'll see some of the highest average salaries belong to jobs in data science. But what are the number of opportunities in this job like? Is there a high demand for these particular jobs?

In [77]:
displayTop10Jobs(top_10_jobs)

Indeed, most jobs are in Data Science too. But maybe money isn't your top concern? If you get fed up sitting in an office all the time, there's opportunity to travel the world and sit in an office in another country.

Do you want to travel?¶

Do you want to travel the world as part of your job? Then AI gives you that opportunity. Even if the money isn't that important to you, you'll need a decent income to support yourself wherever you are. In AI, you get the best of both worlds - good pay and a chance to see the world.

In [78]:
displayAvgSalsByCompLocation(avg_sals_comp_location)

Maybe Russia's not high on your list of countries to visit, but it's clear that the average salary is high in all of the countries in the chart. This isn't surprising given their hefty investment in AI. In fact, it's thought Russia's budget is approximately 4 billion rubles, equivalent to approximately 60.2 million USD [6]. Despite their war with Ukraine, President Vladimir Putin is determined to prevent other countries in the West to obtain a monopoly over AI. It's widely reported that the US and China are at the forefront of developing AI and many believe this development is going to "transform the world and revolutionise society in a way similar to the introduction of computers in the 20th century" [7].

So if you'd prefer a life in the sun in Spain, or Denmark, you can expect an average salary of 52.92k to 54.39k respectively which, although at the lower end, is still not to be sniffed at.

Rather work at home?¶

Remote working's absolutely an option in AI. In fact, the Covid-19 pandemic has resulted in huge changes in how people work. Yes, the companies paying the higher average salaries might be located in Russia, the US, and New Zealand. But the data shows us that remote working is on an upward trend.

In [79]:
displayRemoteWorkingByYear()

It's striking that in 2022, our data shows there were 247 employees working on a fully remote basis. This has risen from 37 employees working fully remotely in 2020. It's highly likely this is a result of the pandemic. When the pandemic first took hold of the world, a lot of companies were not ready to move to fully remote working, but by 2022 these companies had had time to prepare and move to remote working, hence the clear increase from 2020 (37) to 2021 (117) to 2022 (247).

So whether you're a homebird or desperate to move country, whether you're bothered about money or not, there's plenty of opportunities in AI.

Part 6: Challenges and Reflection¶

The analysis undertaken in this project presented various challenges, both on a personal level and on a programatic level.

The most notable challenge occurred early in the project. Initially, a dataset pertaining to road accidents in the UK was selected (as was mentioned in Part 1). This was significantly larger than the AI salaries dataset. It contained three CSV files with approximately 500,000 rows. All was going well until Data Visualisation, where Jupyter Notebook slowed down considerably with each plot that was added. Jupyter Notebook froze completely and my own laptop simply isn't powerful enough to work with that size of dataset. Fortunately, I make an effort to manage my time well, so there was plenty of time to start afresh with a different, smaller dataset. Although very frustrating and disappointing (it was very interesting to work with real-world data), it has certainly been a learning curve in the sense that I will consider the specification of my equipment when selecting datasets in the future.

A positive side of this happening is that working with a smaller dataset has shown what could be done with real-world datasets. For example, being able to extract certain data from a dataset given a condition and display complex information in a very organised and presentable manner. I have learned that data analysis can really guide decisions in the real world. If we can turn rows and rows of data into useful information, decision-makers can act accordingly: if data analysis and visualisation highlights an upward trend in road accidents, for instance, this could trigger safety campaigns and driving test reviews.

There have been challenges relating to programming too. At first, I could not get my head around how a pivot table worked. However, after spending some time on this I can see now how beneficial they are in data visualisation. To have a single table that can quickly transform data in our desired way must be quite powerful in real-life data science. For example, I imagine this would have been very useful at the peak of the Covid-19 pandemic when analysts wanted to summarise large amounts of data that would later be presented to a diverse audience, many without scientific expertise.

Another learning curve of this assignment has been, quite simply, learning Python. I started my degree with absolutely no programming experience, and have primarily learned Java and SQL throughout years 1 and 2. The prospect of learning another language was quite daunting at first. However, this assignment has really shown me how powerful Python is. It is a language I would like to become more competent at using because it is clear how usable it is in the real world.

Overall, this has been an insightful assignment. Although it is disappointing I could not pursue this assignment using the road accident dataset, it has still presented valuable learning opportunities that have given me a deeper insight into real-world data science tasks.

Bibliography¶

[1] “datasets/salaries-ai-jobs-net.csv at master · plotly/datasets,” GitHub. https://github.com/plotly/datasets/blob/master/salaries-ai-jobs-net.csv (accessed November-December, 2023)

[2] M. Waskom, “seaborn.countplot — seaborn 0.9.0 documentation,” Pydata.org, 2012. https://seaborn.pydata.org/generated/seaborn.countplot.html

[3] “Visualizing the distribution of a dataset — seaborn 0.9.0 documentation,” Pydata.org, 2012. https://seaborn.pydata.org/tutorial/distributions.html

[4] “Data Science Statistics Correlation Matrix,” www.w3schools.com. https://www.w3schools.com/datascience/ds_stat_correlation_matrix.asp

[5] S. Kadam, “Generating Word Cloud in Python,” GeeksforGeeks, May 11, 2018. https://www.geeksforgeeks.org/generating-word-cloud-python/

[6] Samuel Bendett et al., “Artificial Intelligence, China, Russia, and the Global Order Technological, Political, Global, and Creative Perspectives,” Oct. 2019. https://www.jstor.org/stable/pdf/resrep19585.28.pdf

[7] G. Faulconbridge, “Putin says West cannot have AI monopoly so Russia must up its game” Reuters, Nov. 24, 2023. Available: https://www.reuters.com/technology/putin-approve-new-ai-strategy-calls-boost-supercomputers-2023-11-24/